The data needed for this exercise is taken from a similar course, in particular it comes from this publication. We will work with the output from the differential expression analysis of KO vs WT. The data files should be in a data/ subdirectory in your R working directory. You can download the data/ directory with files here.
The idea is that you should go through all the steps in this tutorial and keep track of your code and plots using Rmarkdown. Also include comments and answers to questions in this document. Today we will try to generate a PDF file instead of a HTML file, and this will be the report that you should hand in. To do this, you need to modify the Rmarkdown header a bit, see examples here.
Make a new directory for this exercise and add the data/ directory to it. Start R and use setwd() to change the working directory to the one you created.
We will use the following R packages in this exercise:
library(knitr)
library(topGO)
library(biomaRt)
library(piano)
library(NMF)
If you are not able to load any of these packages, try to install them, either using biocLite() (first you need to run source("http://www.bioconductor.org/biocLite.R")) or install.packages(). If something fails, try to understand the error message and fix it. If you get stuck, ask for help :)
diffExpRes <- read.delim("data/results_DE.txt", stringsAsFactors=F)
head(diffExpRes[,c(1,3,5,8,9)]) # skip some columns
## ensembl_gene_id mgi_symbol logFC PValue FDR
## 1 ENSMUSG00000034810 Scn7a -2.860822 4.613775e-45 6.456978e-41
## 2 ENSMUSG00000024799 Tm7sf2 -2.166921 9.630122e-44 6.738678e-40
## 3 ENSMUSG00000027200 Sema6d -1.836609 5.145875e-42 2.400551e-38
## 4 ENSMUSG00000022483 Col2a1 1.880766 2.420791e-41 7.273966e-38
## 5 ENSMUSG00000022231 Sema5a -3.039413 2.598773e-41 7.273966e-38
## 6 ENSMUSG00000023832 Acat2 -1.934613 8.941421e-41 2.085587e-37
Question 1: How many genes are significant at a cutoff of FDR<0.001?
We will start by performing overrepresentation analysis (a.k.a. list enrichment analysis, …) by using Enrichr and DAVID. Both use a scoring method similar to the Hypergeomtric/Fisher’s exact test. To do such a test, we need to have a list of interesting (e.g. differentially expressed) genes and a list of the background (also known as universe). However, both Enrichr and DAVID have their own background lists, so we do not need to specify them explicitly. First, let’s make a list of interesting genes!
write.table function to print the top significant genes to copy paste into some web browser tools:sel <- diffExpRes$ensembl_gene_id[diffExpRes$FDR<0.001]
write.table(sel, quote=F, row.names=F, col.names=F)
Question 2: What is the alternative to the Fisher Exact P-Value used by DAVID?
Question 3: Were all gene Ensembl IDs recognized? How many were unmapped?
Question 4: What seems to be the main functions of the top significant genes (i.e. in summary what are the significantly enriched gene-sets)? Does this make sense considering the experimental design?
sel object. Also, DAVID will probably warn you at some point that the genes map to multiple species. Select the correct species and also go to the background tab and select the correct background.Question 5: Were all gene symbols recognized? How many were unmapped?
Question 6: How many genes in our list overlap with the GO-term Lipid biosynthesis? (Hint: check e.g. the Functional Annotation Clustering)
Question 7: Does this list give the same results as using the Ensembl IDs?
As an alternative to DAVID we can try Enrichr. This page takes human gene symbols as input, so first we need to translate our mouse Ensembl IDs to that. This can be done in many ways, one way is to use the biomaRt package (find a user guide here or type vignette("biomaRt")):
# Use biomaRt to translate the mouse genes into human orthologs:
# Select the Ensembl BioMart database
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
# Select the mouse dataset and update the Mart object:
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_ensembl_gene"), # this is what we want to extract
filters="ensembl_gene_id", # this determines the filter
values=sel, # this are the values to filter for (get only the genes in our list)
mart=ensembl)
head(bm)
## ensembl_gene_id hsapiens_homolog_ensembl_gene
## 1 ENSMUSG00000000093 ENSG00000121068
## 2 ENSMUSG00000000120 ENSG00000064300
## 3 ENSMUSG00000000184 ENSG00000118971
## 4 ENSMUSG00000000223 ENSG00000102385
## 5 ENSMUSG00000000253 ENSG00000137198
## 6 ENSMUSG00000000416 ENSG00000077063
As you can see, the above code gave us a map between mouse and human ensembl gene IDs. But this is not exactly what we wanted, we need human gene symbols.
attributes argument to the proper setting.Use the function listAttributes(ensembl) to see all possible options. Hint: if there are too many options to go through you can use grep to pull out the relevant ones, e.g.:
tmp <- listAttributes(ensembl)
tmp[grep("Human",tmp[,2]),]
If you are working in RStudio you can also use the convenient command View and then search for human, e.g.:
View(listAttributes(ensembl))
Try it also for the bm object. For future reference, note that you can also use listDatasets(ensembl) to see which datasetes you can choose from, try it!
Question 8: How many of the genes in our selected list have gene symbols?
duplicated and unique functions:# A vector to test stuff on:
tmp <- c(1,2,2,3,4,4,4,5)
# Get unique elements:
unique(tmp)
## [1] 1 2 3 4 5
# Which elements are duplicated?
duplicated(tmp)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
# How many duplicates are there:
sum(duplicated(tmp))
## [1] 3
# Which are the duplicated elements:
tmp[duplicated(tmp)]
## [1] 2 4 4
# Which are the unique duplicated elements:
unique(tmp[duplicated(tmp)])
## [1] 2 4
Make sure you understand how these two functions work and what they do, then move ahead with the following steps and questions:
Question 9: Are there duplicated Ensembl IDs or gene symbols in the list for Enrichr (
bm)? How can this affect the results that we get?
getBM again, this time also adding information about % identity between mouse and human genes.?boxplot if you are unsure about this. Try a histogram as well (hist())!Question 10: Looking at the boxplot, are there any reasons to be concerned?
Question 11: Which gene in the list has the lowest % identity?
Question 12: Are the results similar to those we got from DAVID?
Question 13: What do the edges (links) in the network view on the Enrichr webpage denote?
Question 14: We have this far used a cutoff of FDR<0.001. Try a few different cutoffs and rerun DAVID and Enrichr. How different are the results?
Question 15: For the Enrichr and DAVID results so far, can we know whether the identified functions are active/inactive or up/down-regulated in KO vs control? If not, how can we go about to get a clue about that?
Above, we have looked at some different types of gene-set collections, but a lot of focus has been on GO-terms.
Question 16: What are the three main domains/ontologies?
Question 17: By whom and how are GO-terms maintained/defined/updated?
Question 18: What is the evidence code?
Question 19: How are GO-terms connected to each other?
We can use e.g. the topGO package to visualize gene-set analysis results on the GO hierarchy network (topGO manual).
# Format for topgo - remove genes without ensembl id:
d_topgo <- diffExpRes[!is.na(diffExpRes$ensembl_gene_id),]
rownames(d_topgo) <- d_topgo$ensembl_gene_id
# Create the topGOdata object:
GOobj <- new("topGOdata",
description = "Simple session", # just a name, can be changed
ontology = "BP", # set to BP, MF, or CC
allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
geneSel = function(x) x<0.01, # a function to select genes from the allGenes vector
nodeSize = 10, # remove GO-terms with less than this number of genes
mapping = "org.Mm.eg.db", # annotation database
ID = "ensembl", # gene ID used as names in allGenes vector
annot = annFUN.org)
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=15, useInfo ='all')
If you save the plot as a PDF it will be easier to zoom in and read the node text. (Sometimes the plotting misbehaves in RStudio, try running grid.newpage(), plot.new(), and/or par(mfcol=c(1,1)) if plotting looks weird.)
Question 20: What are the circles and squares denoting?
Question 21: Which is the most significant BP GO-term?
Question 22: Which is the most significant CC GO-term?
Piano is an R-package for carrying out gene-set analysis (see more info here).
gscGO <- loadGSC("data/GSC/c5.bp.v5.2.symbols.gmt")
gscGO
## First 50 (out of 4653) gene set names:
## [1] "GO_REGULATION_O..." "GO_LACTATE_TRAN..." "GO_POSITIVE_REG..."
## [4] "GO_DETECTION_OF..." "GO_CARDIAC_CHAM..." "GO_CALCIUM_ION_..."
## [7] "GO_DNA_DEPENDEN..." "GO_TRNA_AMINOAC..." "GO_CIRCADIAN_RH..."
## [10] "GO_PHOSPHATIDYL..." "GO_SPINAL_CORD_..." "GO_REGULATION_O..."
## [13] "GO_PLATELET_DER..." "GO_GLUCURONATE_..." "GO_CELLULAR_RES..."
## [16] "GO_REGULATION_O..." "GO_POSITIVE_REG..." "GO_CEREBRAL_COR..."
## [19] "GO_POSITIVE_REG..." "GO_NEGATIVE_REG..." "GO_POTASSIUM_IO..."
## [22] "GO_L_PHENYLALAN..." "GO_REGULATION_O..." "GO_CARDIAC_MUSC..."
## [25] "GO_NEGATIVE_REG..." "GO_MOVEMENT_IN_..." "GO_REGULATION_O..."
## [28] "GO_APICAL_PROTE..." "GO_REGULATION_O..." "GO_FOREBRAIN_NE..."
## [31] "GO_POSITIVE_REG..." "GO_NEUROMUSCULA..." "GO_MITOTIC_CYTO..."
## [34] "GO_NEGATIVE_REG..." "GO_SMAD_PROTEIN..." "GO_CYTOPLASMIC_..."
## [37] "GO_MEIOTIC_CHRO..." "GO_POSITIVE_REG..." "GO_REGULATION_O..."
## [40] "GO_RNA_DEPENDEN..." "GO_REGULATION_O..." "GO_REGULATION_O..."
## [43] "GO_DENDRITE_DEV..." "GO_REGULATION_O..." "GO_G_PROTEIN_CO..."
## [46] "GO_MEMORY" "GO_NEURON_DEVEL..." "GO_REGULATION_O..."
## [49] "GO_ENDOTHELIAL_..." "GO_POSITIVE_REG..."
##
## First 50 (out of 15578) gene names:
## [1] "PDE1B" "NR4A2" "MAOB" "DRD4" "TACR3" "PARK2"
## [7] "COMT" "HTR1A" "GPR37" "HPRT1" "CHRNB2" "SNCA"
## [13] "SLC6A3" "ABAT" "DRD1" "PNKD" "PARK7" "SLC16A8"
## [19] "SLC16A5" "SLC16A4" "SLC5A12" "EMB" "SLC16A1" "SLC16A7"
## [25] "SLC16A12" "SLC16A13" "SLC16A6" "SLC16A3" "SLC16A11" "POLR2C"
## [31] "POLR2J" "CTDP1" "RDBP" "COBRA1" "RSF1" "POLR2B"
## [37] "POLR2D" "POLR2G" "POLR2F" "POLR2A" "SNW1" "CHD1"
## [43] "POLR2K" "CDK9" "JUN" "POLR2E" "CCNT1" "LEF1"
## [49] "POLR2H" "GTF2F2"
##
## Gene set size summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 17.0 34.0 111.5 92.0 1984.0
##
## No additional info available.
The printed info is always good to check. Did the genes get read as genes and the gene-sets as gene-sets? Do the gene-set sizes seem reasonable? This is a way to sanity check that the loading from a file worked as expected. Usually, reading data from a file is a weak point in any workflow and it is good to carefully check that the data in the file got read as intended. These gene-sets are annotated with human gene symbols, which we currently don’t have in our data.
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"),
filters="ensembl_gene_id",
values=diffExpRes$ensembl_gene_id, # note that now we use all genes here!
mart=ensembl)
bm <- bm[bm[,2]%in%unique(unlist(gscGO)),] # filter for genes in the GO gene-set collection
head(bm)
## ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000001 GNAI3
## 2 ENSMUSG00000000028 CDC45
## 6 ENSMUSG00000000078 KLF6
## 7 ENSMUSG00000000085 SCMH1
## 8 ENSMUSG00000000088 COX5A
## 9 ENSMUSG00000000093 TBX2
nrow(bm)
## [1] 10489
Question 23: Are there any duplicates in
bm? How many? What does this mean?
bm:bm <- bm[!duplicated(bm[,1]),]
nrow(bm)
Question 24: How many rows did we remove?
Question 25: How many unique mouse Ensembl IDs could we map to human orthologs? How many unique human orhtologs do we map to?
Note that multiple ENSMUS IDs still can map to the same gene symbol!
# merge diffExpRes and bm:
d_piano <- merge(diffExpRes, bm, by="ensembl_gene_id", all.x=T, sort=F)
# get geneNames, pvals and logfc:
geneNames <- d_piano$hsapiens_homolog_associated_gene_name
pvals <- d_piano$FDR
names(pvals) <- geneNames
logfc <- d_piano$logFC
names(logfc) <- geneNames
head(), View() and length()) so that you know how they look and what they contain.We now have the gene-level data in a convenient format and we are ready to continue with the analysis.
First, let’s use piano to perform a hypergeometric test, similar to what we did with Enrichr, DAVID, and topGO.
# Get the gene-list:
sel <- names(pvals)[pvals<0.001]
sel <- unique(sel[!is.na(sel)])
# Run the analysis:
res <- runGSAhyper(genes=sel, universe=unique(names(pvals)), gsc=gscGO, gsSizeLim=c(20, 200))
# Visualize results:
networkPlot(res, class="non", adjusted=T, significance=0.0001, ncharLabel=Inf,
nodeSize=c(3,20), edgeWidth=c(1,15), cexLabel=0.7, overlap=20,
scoreColors=c("greenyellow","darkgreen"))
Here, we use the
networkPlot function which draws a network for the most significant gene-sets.
Question 26: What does
gsSizeLim=c(20, 200)in the code above mean?
Question 27: What do the colors, node-sizes, and edges denote?
Question 28: Are the results in line with those from topGO, DAVID, and Enrichr?
Next, we will use piano to perform gene-set analysis, i.e. not the enrichment of a gene-list based on a cut-off, but using all available gene-level statistics. We will use signed (to keep track of fold-change) -log10-pvalues to rank the gene-sets, i.e. -log10(pvals)*sign(logfc).
gsaRes <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="fgsea", gsc=gscGO, gsSizeLim=c(20, 200))
networkPlot(gsaRes, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
nodeSize=c(3,15), edgeWidth=c(1,8), cexLabel=0.7, overlap=20,
scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))
Question 29: Look at the network plot, could gene overlap between you gene-sets bias the result interpretation in any way?
The network plot can be nice for visualizing your results, but the function GSAsummaryTable can be used to get a table of the all the GSA results.
View() command:View(GSAsummaryTable(gsaRes))
Question 30: What is the top significant GO-term affected by down-regulation and up-regulation, respectively?
data/GSC/c2.cp.v5.2.symbols.gmt, which contains pathway gene-sets from MSigDB, as a gene-set collection (load it with loadGSC()).geneSetStat="mean") instead of fgsea.For example it could look something similar to this:
Question 31: Are these results in line with previous results that we have seen in this exercise? Do you feel there is a consistent pattern?
It is always good to go back to the gene-level while looking at your final GSA results. Here we will use a heatmap for visualizing gene-level data for selected gene-sets (using one of many heatmap functions available for R).
# Get genes for a specific gene-set:
selGenes <- names(geneSetSummary(gsaRes, "GO_STEROID_BIOSYNTHETIC_PROCESS")$geneLevelStats)
# Merge with the diffexp result table:
selTab <- merge(cbind(selGenes), d_piano[,c(1,3,5,9,10)],
by.x=1, by.y="hsapiens_homolog_associated_gene_name", all.x=T)
selTab$log10FDR <- -log10(selTab$FDR)*sign(selTab$logFC)
# Display this table:
kable(selTab)
| selGenes | ensembl_gene_id | mgi_symbol | logFC | FDR | log10FDR |
|---|---|---|---|---|---|
| ACAA2 | ENSMUSG00000036880 | Acaa2 | 0.3227222 | 0.0729742 | 1.1368305 |
| ACBD3 | ENSMUSG00000026499 | Acbd3 | -0.2117000 | 0.8284461 | -0.0817357 |
| ACLY | ENSMUSG00000020917 | Acly | -1.0100942 | 0.0000000 | -14.8512572 |
| ACOT8 | ENSMUSG00000017307 | Acot8 | 0.1464229 | 0.8183638 | 0.0870536 |
| AKR1B15 | ENSMUSG00000061758 | Akr1b10 | 0.4710835 | 0.2556798 | 0.5923037 |
| AKR1B15 | ENSMUSG00000029762 | Akr1b8 | 0.4084569 | 0.2292405 | 0.6397086 |
| AKR1B15 | ENSMUSG00000061758 | Akr1b10 | 0.4710835 | 0.2556798 | 0.5923037 |
| AKR1B15 | ENSMUSG00000029762 | Akr1b8 | 0.4084569 | 0.2292405 | 0.6397086 |
| AKR1C3 | ENSMUSG00000021213 | Akr1c13 | 0.1587939 | 0.9247777 | 0.0339627 |
| AKR1C4 | ENSMUSG00000021213 | Akr1c13 | 0.1587939 | 0.9247777 | 0.0339627 |
| AMACR | ENSMUSG00000022244 | Amacr | 0.1090078 | 0.8615007 | 0.0647444 |
| ARV1 | ENSMUSG00000031982 | Arv1 | -0.0702522 | 0.9464355 | -0.0239090 |
| C14orf1 | ENSMUSG00000021252 | 0610007P14Rik | -0.9167006 | 0.0000000 | -9.7075763 |
| CACNA1H | ENSMUSG00000024112 | Cacna1h | 0.6248563 | 0.0628600 | 1.2016254 |
| CNBP | ENSMUSG00000030057 | Cnbp | 0.0848649 | 0.8070392 | 0.0931054 |
| CYB5R1 | ENSMUSG00000026456 | Cyb5r1 | 0.4548193 | 0.1030950 | 0.9867626 |
| CYB5R3 | ENSMUSG00000018042 | Cyb5r3 | -0.2805545 | 0.1706601 | -0.7678679 |
| CYP11A1 | ENSMUSG00000032323 | Cyp11a1 | 0.6369639 | 0.5239671 | 0.2806960 |
| CYP27A1 | ENSMUSG00000026170 | Cyp27a1 | 0.1075853 | 0.9047557 | 0.0434687 |
| CYP2R1 | ENSMUSG00000030670 | Cyp2r1 | -0.2493062 | 0.8816242 | -0.0547165 |
| CYP39A1 | ENSMUSG00000023963 | Cyp39a1 | -1.0522696 | 0.0000000 | -9.8271986 |
| CYP3A4 | ENSMUSG00000029727 | Cyp3a13 | 0.0295973 | 0.9806181 | 0.0085001 |
| CYP46A1 | ENSMUSG00000021259 | Cyp46a1 | -0.4376909 | 0.6736854 | -0.1715429 |
| CYP51A1 | ENSMUSG00000001467 | Cyp51 | -1.6531937 | 0.0000000 | -20.8889340 |
| CYP7B1 | ENSMUSG00000039519 | Cyp7b1 | -0.0504172 | 0.8895779 | -0.0508160 |
| DHCR24 | ENSMUSG00000034926 | Dhcr24 | -1.7828603 | 0.0000269 | -4.5701737 |
| DHCR7 | ENSMUSG00000058454 | Dhcr7 | -0.8584124 | 0.0000000 | -8.8078473 |
| EBP | ENSMUSG00000031168 | Ebp | -0.5232713 | 0.0206938 | -1.6841595 |
| FDFT1 | ENSMUSG00000021273 | Fdft1 | -1.4720779 | 0.0000000 | -22.7374916 |
| FDPS | ENSMUSG00000059743 | Fdps | -2.0806552 | 0.0000232 | -4.6340487 |
| FDX1 | ENSMUSG00000032051 | Fdx1 | 0.3163207 | 0.4555077 | 0.3415043 |
| FDXR | ENSMUSG00000018861 | Fdxr | -0.2715724 | 0.4486059 | -0.3481350 |
| GGPS1 | ENSMUSG00000021302 | Ggps1 | -0.1880611 | 0.7162604 | -0.1449291 |
| HINT2 | ENSMUSG00000028470 | Hint2 | 0.1070335 | 0.8354226 | 0.0780938 |
| HMGCR | ENSMUSG00000021670 | Hmgcr | -1.9043148 | 0.0000000 | -22.9236949 |
| HMGCS1 | ENSMUSG00000093930 | Hmgcs1 | -2.1349687 | 0.0000000 | -21.3281530 |
| HMGCS2 | ENSMUSG00000027875 | Hmgcs2 | 0.4020805 | 0.0251885 | 1.5987971 |
| HSD11B2 | ENSMUSG00000031891 | Hsd11b2 | 1.0354925 | 0.1483534 | 0.8287025 |
| HSD17B11 | ENSMUSG00000029311 | Hsd17b11 | 0.2584276 | 0.2948881 | 0.5303427 |
| HSD17B12 | ENSMUSG00000027195 | Hsd17b12 | -0.4645855 | 0.0014895 | -2.8269590 |
| HSD17B14 | ENSMUSG00000030825 | Hsd17b14 | -0.4382260 | 0.7236379 | -0.1404787 |
| HSD17B4 | ENSMUSG00000024507 | Hsd17b4 | -0.0393758 | 0.9229765 | -0.0348093 |
| HSD17B7 | ENSMUSG00000026675 | Hsd17b7 | -1.4288825 | 0.0000000 | -9.6138912 |
| HSD3B7 | ENSMUSG00000042289 | Hsd3b7 | 0.0582406 | 0.9228457 | 0.0348709 |
| IDI1 | ENSMUSG00000058258 | Idi1 | -1.8497690 | 0.0000000 | -7.7800272 |
| INSIG1 | ENSMUSG00000045294 | Insig1 | -1.4724813 | 0.0000000 | -20.1443378 |
| INSIG2 | ENSMUSG00000003721 | Insig2 | -0.0976972 | 0.8562685 | -0.0673900 |
| LBR | ENSMUSG00000004880 | Lbr | -0.2599271 | 0.3926051 | -0.4060441 |
| LSS | ENSMUSG00000033105 | Lss | -2.3815805 | 0.0000018 | -5.7332717 |
| MED1 | ENSMUSG00000018160 | Med1 | 0.1449697 | 0.5948939 | 0.2255605 |
| MSMO1 | ENSMUSG00000031604 | Msmo1 | -1.9228029 | 0.0000000 | -30.1111731 |
| MVK | ENSMUSG00000041939 | Mvk | -1.0533312 | 0.0000000 | -12.1796821 |
| NSDHL | ENSMUSG00000031349 | Nsdhl | -1.3307947 | 0.0000000 | -14.8501597 |
| PBX1 | ENSMUSG00000052534 | Pbx1 | 0.0237174 | 0.9547740 | 0.0200994 |
| PMVK | ENSMUSG00000027952 | Pmvk | -1.0456293 | 0.0000000 | -10.2062817 |
| PRKAA1 | ENSMUSG00000050697 | Prkaa1 | 0.7674139 | 0.3298035 | 0.4817447 |
| PRKAA2 | ENSMUSG00000028518 | Prkaa2 | 1.0590779 | 0.0226386 | 1.6451495 |
| PRKAG2 | ENSMUSG00000028944 | Prkag2 | -0.5878811 | 0.0552485 | -1.2576795 |
| SCARB1 | ENSMUSG00000037936 | Scarb1 | -0.3765647 | 0.0507384 | -1.2946633 |
| SCP2 | ENSMUSG00000028603 | Scp2 | -0.2126157 | 0.8743149 | -0.0583321 |
| SDR42E1 | ENSMUSG00000034308 | Sdr42e1 | 0.3786522 | 0.6575808 | 0.1820508 |
| SQLE | ENSMUSG00000022351 | Sqle | -1.2502197 | 0.0000000 | -14.1391715 |
| SRD5A3 | ENSMUSG00000029233 | Srd5a3 | 0.0355565 | 0.9386586 | 0.0274923 |
| STAR | ENSMUSG00000031574 | Star | -0.1712996 | 0.8121067 | -0.0903869 |
| STARD3 | ENSMUSG00000018167 | Stard3 | -0.2927755 | 0.1854065 | -0.7318751 |
| STARD5 | ENSMUSG00000046027 | Stard5 | 0.4838561 | 0.0045057 | 2.3462366 |
| TM7SF2 | ENSMUSG00000024799 | Tm7sf2 | -2.1669206 | 0.0000000 | -39.1714253 |
| TSPO | ENSMUSG00000041736 | Tspo | 0.0877881 | 0.8059029 | 0.0937173 |
| WNT4 | ENSMUSG00000036856 | Wnt4 | 1.8713034 | 0.0014986 | 2.8243258 |
Question 32: Does the table look like you expect it to (did the human to mouse gene symbol matching work, are the signed log10-FDRs consistent with the logFC and FDR columns)?
# Read in the count data:
tableCounts <- read.table(file="data/tableCounts_with_location.tab", sep="\t", header=TRUE, skip=1)
rownames(tableCounts) <- tableCounts[,1]
tableCounts <- tableCounts[,7:12]
colnames(tableCounts) <- c("KO_1","KO_2", "KO_3", "WT_1", "WT_2", "WT_3");
# Merge count data with the selTab:
selTab <- merge(selTab, tableCounts, by.x="ensembl_gene_id", by.y=0, all.x=T)
# Plot a heatmap:
# Row annotation:
rowann <- list(Significant=ifelse(selTab$FDR<0.05,"yes","no"), Regulation=ifelse(selTab$logFC>0,"Up","Down"))
rowannCol <- list(Significant=c("black","green"), Regulation=c("blue","red"))
# Row labels:
rowlabs <- selTab$mgi_symbol
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol,
labRow=rowlabs,
txt=selTab[,7:12])
Note that here we scale the rows so the colors will be visualizing the difference for each gene specifically, but is not comparable across genes. Therefore we also include the actual counts in each heatmap cell.
Question 33: Why are the rows and columns reordered?
Question 34: Does it look “better”, what is the drawback/benefit?
Question 35: Given the count data for these genes, does the differential expression results make sense, as indicated by the column annotation columns? Any specific genes that you are concerned about?
Question 36: Would you say from looking at the heatmap, that in general steroid biosynthesis is down-regulated or up-regulated? Does your answer reflect the GSA results as seen in the network plot above?
Question 37: Are these genes actually involved in steroid biosynthesis? Are you confident enough from the data and analysis results to say anything about how YAP/TAZ knockout affects this biological process?
This page was generated using the following R session:
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 Rgraphviz_2.18.0 org.Mm.eg.db_3.4.0
## [4] NMF_0.20.6 cluster_2.0.5 rngtools_1.2.4
## [7] pkgmaker_0.22 registry_0.3 piano_1.14.0
## [10] biomaRt_2.30.0 topGO_2.26.0 SparseM_1.72
## [13] GO.db_3.4.0 AnnotationDbi_1.36.0 IRanges_2.8.0
## [16] S4Vectors_0.12.0 Biobase_2.34.0 graph_1.52.0
## [19] BiocGenerics_0.20.0 knitr_1.14
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 lattice_0.20-34 relations_0.6-6
## [4] gtools_3.5.0 assertthat_0.1 digest_0.6.10
## [7] foreach_1.4.3 gridBase_0.4-7 slam_0.1-37
## [10] plyr_1.8.4 chron_2.3-47 RSQLite_1.0.0
## [13] evaluate_0.10 highr_0.6 ggplot2_2.1.0
## [16] gplots_3.0.1 data.table_1.9.6 gdata_2.17.0
## [19] rmarkdown_1.1 sets_1.0-16 BiocParallel_1.8.0
## [22] stringr_1.1.0 igraph_1.0.1 RCurl_1.95-4.8
## [25] munsell_0.4.3 fgsea_1.0.0 marray_1.52.0
## [28] htmltools_0.3.5 tibble_1.2 gridExtra_2.2.1
## [31] codetools_0.2-15 matrixStats_0.51.0 XML_3.98-1.4
## [34] bitops_1.0-6 xtable_1.8-2 gtable_0.2.0
## [37] DBI_0.5-1 magrittr_1.5 formatR_1.4
## [40] scales_0.4.0 KernSmooth_2.23-15 stringi_1.1.2
## [43] reshape2_1.4.1 doParallel_1.0.10 limma_3.30.0
## [46] fastmatch_1.0-4 iterators_1.0.8 tools_3.3.1
## [49] yaml_2.1.13 colorspace_1.2-7 caTools_1.17.1